Definition of p value

The p-value is an important term used in hypothesis testing that measures the significance of data against a null hypothesis. - In this presentation, we will: Determine the meaning and importance of p-value. – Use an example using actual data on real estate prices. - Carry out hypothesis testing with a view of establishing whether the prices of water fronting have higher prices than those which are not water fronting. - Tools and visualizations will include: - Statistical testing in R. Mean/gross, standard deviation/distribution, correlation between variables/features, distribution of features, principal components analysis.

What is the p-value?

The main measure of association used in the study is known as the p-value which is equal to the chance of obtaining results as extreme as, or even more extreme than, those recorded, given that the null hypothesis is actually true. - It helps answer the question: >*How can this result be explained: was the effect random, or is there truly an effect?

Key Points:

  1. Small p-value (e.g., ≤ 0.05):
    • Strong evidence against the null hypothesis.
    • We may reject the null hypothesis.
  2. Large p-value (> 0.05):
    • Weak evidence against the null hypothesis.
    • Fail to reject the null hypothesis.

Formula for Test Statistic:

\[ T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \] Where: - \(\bar{X}_1, \bar{X}_2\): Sample means. - \(s_1^2, s_2^2\): Sample variances. - \(n_1, n_2\): Sample sizes.

Real-World Example: Real Estate Data

Source : https://www.kaggle.com/code/fareedalianwar/house-price/input?select=data.csv

  • Dataset Overview

The dataset contains property sales records, including features such as: Price: Sale price of the property. Bedrooms, Bathrooms: Number of bedrooms and bathrooms. sqft_living: Square footage of the living area. Waterfront: Whether the property has a waterfront view (1 = Yes, 0 = No). And more…

Objective

  • Hypothesis: The initial research question is: Are the average prices of the properties with a waterfront view much higher or lower than the average price of the properties without a waterfront view?

Hypotheses

  1. Null Hypothesis (\(H_0\)): As for the average price for each house, there is no difference between the prices of the house by the water and the prices of the other houses.

  2. Alternative Hypothesis (\(H_1\)): Moving to the property price aspect the data presented shows that the price of houses along the waterfront is higher compared to houses that are not located in the region.

Data Exploration

Key Steps in Exploring the Data

  1. Summary Statistics:
    • Compare the average property prices for waterfront and non-waterfront properties.
    • Look at the spread of data (e.g., standard deviation, range).
  2. Visualizing the Data:
    • Plot price distributions for both groups to observe patterns.

Summary Sraristics Formula

  • Mean: \(\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i\)
  • Standard Deviation: \(\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (X_i - \bar{X})^2}\)
  • Median: The middle value in the dataset when ordered.

Summary Statistics Example (R Code):

# Load necessary libraries
library(dplyr)
data <- read.csv("data.csv")
# Calculate summary statistics
summary_stats <- data %>%
  group_by(waterfront) %>%
  summarize(
    mean_price = mean(price, na.rm = TRUE),
    median_price = median(price, na.rm = TRUE),
    sd_price = sd(price, na.rm = TRUE),
    count = n()
  )

Results

print(summary_stats)
## # A tibble: 2 × 5
##   waterfront mean_price median_price sd_price count
##        <int>      <dbl>        <dbl>    <dbl> <int>
## 1          0    545462.       460000  547781.  4567
## 2          1   1451621.       988500 1425994.    33

Explanation:

Waterfront Properties: Have a higher average price ($1,451,621) compared to non-waterfront properties ($545,462), as well as a larger standard deviation. Sample Size: There are 33 waterfront properties compared to 4,567 non-waterfront properties, indicating a smaller sample for waterfront properties. This shows clear differences in the statistics, which we will use to test our hypothesis

Hypothesis Testing: Two-Sample t-Test

Objective:

To do this we will use the two sample t-test to compare the mean price of waterfront properties with the mean of non-waterfront properties.

Null Hypothesis (\(H_0\)):

The mean price of waterfront properties is equal to the mean price of non-waterfront properties.
\(H_0: \mu_{\text{waterfront}} = \mu_{\text{non-waterfront}}\)

Alternative Hypothesis (\(H_1\)):

The mean price of waterfront properties is higher than the mean price of non-waterfront properties.
\(H_1: \mu_{\text{waterfront}} > \mu_{\text{non-waterfront}}\)

We will perform a one-tailed t-test to test this hypothesis.

Code to Perform the Two-Sample t-Test

# Perform the t-test
t_test_result <- t.test(price ~ waterfront,
      data = data, alternative = "greater")

# Display the results
t_test_result
## 
##  Welch Two Sample t-test
## 
## data:  price by waterfront
## t = -3.6485, df = 32.068, p-value = 0.9995
## alternative hypothesis: true difference in means between group 0 and group 1 is greater than 0
## 95 percent confidence interval:
##  -1326837      Inf
## sample estimates:
## mean in group 0 mean in group 1 
##        545462.3       1451621.2

t-Test Explanation

  • t-statistic: -3.6485

The negative t-statistic signifies that the non water front properties have been found to be expensive than water front properties – on average. Surprisingly, this goes contrary to our expectation that properties near the water front would attract higher price.

  • p-value: 0.9995

It is also very important to note that the p-value value is greater than 0.05; therefore, there is very little evidence that the null hypothesis should be rejected. Using this data, we can not say that waterfront properties are more costly.

  • 95% Confidence Interval: [-1,326,837, Inf]

    The confidence interval of the difference of the means is negative which pointed out that the difference of the prices might also favor a non-waterfront property.

Conclusion

  • Based on the results of the t-test:

    • p-value = 0.9995, which is much larger than the significance level of 0.05.
    • We fail to reject the null hypothesis that the mean price of waterfront and non-waterfront properties is the same.
  • Interpretation: There is no significant evidence that waterfront properties have higher average prices than non-waterfront properties in this dataset.

  • The analysis suggests that the prices of waterfront properties might not differ as expected from non-waterfront properties based on the data provided.

plotly plot

library(plotly)
plot_ly(data = data, x = ~sqft_living, y = ~price, z = ~waterfront, type = 'scatter3d', mode = 'markers')

ggplot1

library(ggplot2)

# Price distribution plot
ggplot(data, aes(x = factor(waterfront), y = price)) +
  geom_boxplot() +
  labs(title = "Price Distribution: Waterfront vs Non-Waterfront")

ggplot2

# Scatter plot of sqft vs price
ggplot(data, aes(x = sqft_living, y = price)) +
  geom_point(aes(color = factor(waterfront))) +
  labs(title = "Price vs Square Footage")

Final Thoughts

  • Key Takeaways:

    The p-value is one of the major tools used in hypothesis testing because it allows us to gauge how much evidence there is against the null hypothesis.

    In our real estate example we saw that the p - value was close to 1; in fact, it was 0.9995, which made us not reject the null hypothesis and, thus, we found no evidence that supports the hypothesis that waterfront property is overpriced than the non-waterfront property. This brings out the need for correct interpretation of statistical results as well as getting the right results for various applications that include real estate.

  • Implications:

Even though the first hypothesis was not confirmed, the research scrutinizes the importance of proper assessment of the data and assumptions. The example also shows that statistical tools allows being more informative in making decisions in various industries including real estate.

Thank you for your attention!